Objectives

  • Describe the purpose of an R package and the dplyr package.
  • Select certain columns in a dataframe with the dplyr function select.
  • Select certain rows in a dataframe according to filtering conditions with the dplyr function filter.
  • Link the output of one dplyr function to the input of another function with the ‘pipe’ operator %>%.
  • Add new columns to a dataframe that are functions of existing columns with mutate.
  • Use the split-apply-combine concept for data analysis.
  • Use summarize, group_by, and count to split a dataframe into groups of observations, apply a summary statistics for each group, and then combine the results.

Questions

  • How can I select specific rows and/or columns from a dataframe?
  • How can I combine multiple commands into a single command?
  • How can I create new columns or remove existing columns from a dataframe?

dplyr is a package for making tabular data wrangling easier by using a limited set of functions that can be combined to extract and summarize insights from your data.

Like readr, dplyr is a part of the tidyverse. These packages were loaded in R’s memory when we called library(tidyverse) earlier.

Note

The packages in the tidyverse, namely dplyr, tidyr and ggplot2 accept both the British (e.g. summarise) and American (e.g. summarize) spelling variants of different function and option names. For this lesson, we utilize the American spellings of different functions; however, feel free to use the regional variant for where you are teaching.

What is an R package?

The package dplyr provides easy tools for the most common data wrangling tasks. It is built to work directly with dataframes, with many common tasks optimized by being written in a compiled language (C++) (not all R packages are written in R!).

There are also packages available for a wide range of tasks including building plots (ggplot2, which we’ll see later), downloading data from the NCBI database, or performing statistical analysis on your data set. Many packages such as these are housed on, and downloadable from, the Comprehensive R Archive Network (CRAN) using install.packages. This function makes the package accessible by your R installation with the command library(), as you did with tidyverse earlier.

To easily access the documentation for a package within R or RStudio, use help(package = "package_name").

To learn more about dplyr after the workshop, you may want to check out this handy data transformation with dplyr cheatsheet.

Note

There are alternatives to the tidyverse packages for data wrangling, including the package data.table. See this comparison for example to get a sense of the differences between using base, tidyverse, and data.table.

Learning dplyr

To make sure everyone will use the same dataset for this lesson, we’ll read again the kings dataset that we downloaded earlier.

## load the tidyverse
library(tidyverse)

## load the data
kings <- read_csv("data/kings.csv", na = "NULL" ,n_max = 54)

## inspect the data
kings

## preview the data
# view(kings)

We’re going to learn some of the most common dplyr functions:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statistics on grouped data
  • arrange(): sort results
  • count(): count discrete values

Selecting columns and filtering rows

To select columns of a dataframe, use select(). The first argument to this function is the dataframe (kings), and the subsequent arguments are the columns to keep, separated by commas. Alternatively, if you are selecting columns adjacent to each other, you can use a : to select a range of columns, read as “select columns from ___ to ___.” You may have done something similar in the past using subsetting. select() is essentially doing the same thing as subsetting, using a package (dplyr) instead of R’s base functions.

# to select columns throughout the dataframe
select(kings, Name, Start_year, End_year)
# to do the same thing with subsetting
kings[c("Name", "Start_year", "End_year")]
# to select a series of connected columns
select(kings, Name:End_year)

To choose rows based on specific criteria, we can use the filter() function. The argument after the dataframe is the condition we want our final dataframe to adhere to (e.g. House is Gorm):

# filters observations where House is "Gorm"
filter(kings, House == "Gorm")
# A tibble: 7 × 11
  Name         House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
  <chr>        <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  <chr>   <chr>  <chr>   <chr>  
1 Gorm den Ga… Gorm      936     958     908     958 M      Jellin… https… "908"   "958"  
2 Harald 1. B… Gorm       NA      NA     936     985 M      Jellin… https… "936"   "987"  
3 Toke_Gormsen Gorm      985     986      NA     986 M      Jellin… https… ""      ""     
4 Svend 1. Tv… Gorm       NA      NA      NA    1014 M      Jellin… https… "17/04… "03/02…
5 Harald 2.    Gorm     1014    1018     963      NA M      Jellin… https… "994"   "1018" 
6 Knud 1. den… Gorm     1018    1035      NA    1035 M      Jellin… https… "995"   "12/11…
7 Hardeknud    Gorm     1035    1042     995    1042 M      Jellin… https… "1018"  "08/06…
# … with abbreviated variable names ¹​Start_year, ²​End_year, ³​Birth_year, ⁴​Death_year,
#   ⁵​BirthDMY, ⁶​DeathDMY

We can also specify multiple conditions within the filter() function. We can combine conditions using either “and” or “or” statements. In an “and” statement, an observation (row) must meet every criteria to be included in the resulting dataframe. To form “and” statements within dplyr, we can pass our desired conditions as arguments in the filter() function, separated by commas:

# filters observations with "and" operator (comma)
# output dataframe satisfies ALL specified conditions
filter(kings, House == "Estridsen",
                   End_year > 1100,
                    End_year < 1210)
# A tibble: 7 × 11
  Name         House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
  <chr>        <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  <chr>   <chr>  <chr>   <chr>  
1 Erik 1. Eje… Estr…    1095    1103    1050    1103 M      Jellin… "http… "1055"  "10/07…
2 Niels        Estr…    1104    1134    1055    1134 M      Jellin… "http… "1134"  "25/06…
3 Erik 2. Emu… Estr…    1134    1137    1065    1137 M      Jellin… "http… "1100"  "18/09…
4 Erik 3. Lam  Estr…    1137    1146    1100    1146 M      Jellin… "http… "1120"  "27/08…
5 Svend 3.; K… Estr…    1146    1157    1120      NA M      Jellin… ""     ""      ""     
6 Valdemar 1.… Estr…    1157    1182    1131    1182 M      Jellin… "http… "14/01… "12/05…
7 Knud 4.      Estr…    1182    1202    1163    1202 M      Jellin… "http… "1163"  "12/11…
# … with abbreviated variable names ¹​Start_year, ²​End_year, ³​Birth_year, ⁴​Death_year,
#   ⁵​BirthDMY, ⁶​DeathDMY

We can also form “and” statements with the & operator instead of commas:

# filters observations with "&" logical operator
# output dataframe satisfies ALL specified conditions
filter(kings, House == "Estridsen" &
                   End_year > 1100 &
                    End_year < 1210)
# A tibble: 7 × 11
  Name         House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
  <chr>        <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  <chr>   <chr>  <chr>   <chr>  
1 Erik 1. Eje… Estr…    1095    1103    1050    1103 M      Jellin… "http… "1055"  "10/07…
2 Niels        Estr…    1104    1134    1055    1134 M      Jellin… "http… "1134"  "25/06…
3 Erik 2. Emu… Estr…    1134    1137    1065    1137 M      Jellin… "http… "1100"  "18/09…
4 Erik 3. Lam  Estr…    1137    1146    1100    1146 M      Jellin… "http… "1120"  "27/08…
5 Svend 3.; K… Estr…    1146    1157    1120      NA M      Jellin… ""     ""      ""     
6 Valdemar 1.… Estr…    1157    1182    1131    1182 M      Jellin… "http… "14/01… "12/05…
7 Knud 4.      Estr…    1182    1202    1163    1202 M      Jellin… "http… "1163"  "12/11…
# … with abbreviated variable names ¹​Start_year, ²​End_year, ³​Birth_year, ⁴​Death_year,
#   ⁵​BirthDMY, ⁶​DeathDMY

In an “or” statement, observations must meet at least one of the specified conditions. To form “or” statements we use the logical operator for “or,” which is the vertical bar (|):

# filters observations with "|" logical operator
# output dataframe satisfies AT LEAST ONE of the specified conditions
filter(kings, House == "Estridsen" | End_year < 1100)
# A tibble: 28 × 11
   Name        House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
   <chr>       <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  <chr>   <chr>  <chr>   <chr>  
 1 Gorm den G… Gorm      936     958     908     958 M      "Jelli… https… "908"   "958"  
 2 Toke_Gorms… Gorm      985     986      NA     986 M      "Jelli… https… ""      ""     
 3 Harald 2.   Gorm     1014    1018     963      NA M      "Jelli… https… "994"   "1018" 
 4 Knud 1. de… Gorm     1018    1035      NA    1035 M      "Jelli… https… "995"   "12/11…
 5 Hardeknud   Gorm     1035    1042     995    1042 M      "Jelli… https… "1018"  "08/06…
 6 Magnus den… Fair…    1042    1047    1018    1047 M      "Jelli… https… "1024"  "25/10…
 7 Svend 2. E… Estr…    1047    1074    1024    1076 M      ""      https… "1019"  "28/04…
 8 Harald 3. … Estr…    1074    1080    1019    1080 M      "Jelli… https… "1041"  "17/04…
 9 Knud 2. de… Estr…    1080    1086    1041    1086 M      "Jelli… https… "1043"  "10/07…
10 Oluf 1. Hu… Estr…    1086    1095    1043    1095 M      "Jelli… https… "1050"  "18/08…
# … with 18 more rows, and abbreviated variable names ¹​Start_year, ²​End_year,
#   ³​Birth_year, ⁴​Death_year, ⁵​BirthDMY, ⁶​DeathDMY

Pipes

What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.

With intermediate steps, you create a temporary dataframe and use that as input to the next function, like this:

kings_male <- filter(kings, Gender == "M")
kings_new <- select(kings_male, Name:Death_year)
kings_new
# A tibble: 52 × 6
   Name               House     Start_year End_year Birth_year Death_year
   <chr>              <chr>          <dbl>    <dbl>      <dbl>      <dbl>
 1 Gorm den Gamle     Gorm             936      958        908        958
 2 Harald 1. Blåtand  Gorm              NA       NA        936        985
 3 Toke_Gormsen       Gorm             985      986         NA        986
 4 Svend 1. Tveskæg   Gorm              NA       NA         NA       1014
 5 Harald 2.          Gorm            1014     1018        963         NA
 6 Knud 1. den Store  Gorm            1018     1035         NA       1035
 7 Hardeknud          Gorm            1035     1042        995       1042
 8 Magnus den Gode    Fairhair        1042     1047       1018       1047
 9 Svend 2. Estridsen Estridsen       1047     1074       1024       1076
10 Harald 3. Hen      Estridsen       1074     1080       1019       1080
# … with 42 more rows

This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.

You can also nest functions (i.e. one function inside of another), like this:

kings_new <- select(filter(kings, House == "Estridsen"),
                         Name:End_year)

This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).

The last option, pipes, are a recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R look like %>% and are made available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with:

  • Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
kings %>%
    filter(House == "Gorm") %>%
    select(Name:End_year)
# A tibble: 7 × 4
  Name              House Start_year End_year
  <chr>             <chr>      <dbl>    <dbl>
1 Gorm den Gamle    Gorm         936      958
2 Harald 1. Blåtand Gorm          NA       NA
3 Toke_Gormsen      Gorm         985      986
4 Svend 1. Tveskæg  Gorm          NA       NA
5 Harald 2.         Gorm        1014     1018
6 Knud 1. den Store Gorm        1018     1035
7 Hardeknud         Gorm        1035     1042

In the above code, we use the pipe to send the kings dataset first through filter() to keep rows where House is “Gorm”, then through select() to keep only the columns from Name to End_year. Since %>% takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the dataframe as an argument to the filter() and select() functions any more.

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we take the dataframe kings, then we filter for rows with House == "Gorm", then we select columns Name:End_year. The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex data wrangling operations.

If we want to create a new object with this smaller version of the data, we can assign it a new name:

kings_gorm <- kings %>%
    filter(House == "Gorm") %>%
    select(Name:End_year)

kings_gorm
# A tibble: 7 × 4
  Name              House Start_year End_year
  <chr>             <chr>      <dbl>    <dbl>
1 Gorm den Gamle    Gorm         936      958
2 Harald 1. Blåtand Gorm          NA       NA
3 Toke_Gormsen      Gorm         985      986
4 Svend 1. Tveskæg  Gorm          NA       NA
5 Harald 2.         Gorm        1014     1018
6 Knud 1. den Store Gorm        1018     1035
7 Hardeknud         Gorm        1035     1042

Note that the final dataframe (kings_gorm) is the leftmost part of this expression.

Exercise 1

Using pipes, subset the kings data to include kings with non-missing reign dates, who were born in the 14th century (Birth_year > 1300 & Birth_year < 1399) and retain only the columns Name, House, Start_year and End_year.

kings %>%
    filter(Birth_year > 1300 &  Birth_year < 1399) %>%
    select(Name, House, Start_year, End_year) 
# A tibble: 5 × 4
  Name                 House     Start_year End_year
  <chr>                <chr>          <dbl>    <dbl>
1 Valdemar 3.          Estridsen       1326     1329
2 Valdemar 4. Atterdag Estridsen       1340     1375
3 Oluf 2.              Bjelbo          1375     1387
4 Margrete 1.          Estridsen       1387     1396
5 Erik 7. af Pommern   Pomerania       1396     1439

Mutate

Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate().

We might be interested in the number of years each king lived, reigned, and what the mid-year of each ruler’s reign was (such as you will need to plot the rulers on a timeline):

kings %>%
  mutate(Reign_duration = End_year - Start_year)
# A tibble: 54 × 12
   Name        House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
   <chr>       <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  <chr>   <chr>  <chr>   <chr>  
 1 Gorm den G… Gorm      936     958     908     958 M      "Jelli… https… "908"   "958"  
 2 Harald 1. … Gorm       NA      NA     936     985 M      "Jelli… https… "936"   "987"  
 3 Toke_Gorms… Gorm      985     986      NA     986 M      "Jelli… https… ""      ""     
 4 Svend 1. T… Gorm       NA      NA      NA    1014 M      "Jelli… https… "17/04… "03/02…
 5 Harald 2.   Gorm     1014    1018     963      NA M      "Jelli… https… "994"   "1018" 
 6 Knud 1. de… Gorm     1018    1035      NA    1035 M      "Jelli… https… "995"   "12/11…
 7 Hardeknud   Gorm     1035    1042     995    1042 M      "Jelli… https… "1018"  "08/06…
 8 Magnus den… Fair…    1042    1047    1018    1047 M      "Jelli… https… "1024"  "25/10…
 9 Svend 2. E… Estr…    1047    1074    1024    1076 M      ""      https… "1019"  "28/04…
10 Harald 3. … Estr…    1074    1080    1019    1080 M      "Jelli… https… "1041"  "17/04…
# … with 44 more rows, 1 more variable: Reign_duration <dbl>, and abbreviated variable
#   names ¹​Start_year, ²​End_year, ³​Birth_year, ⁴​Death_year, ⁵​BirthDMY, ⁶​DeathDMY

We may be interested in investigating whether being a member of a specific House had any effect on the duration of reign. To look at this relationship, we will first remove data from our dataset where the dates are missing. These cases are recorded as “NULL” in the dataset.

To remove these cases, we could insert a filter() in the chain:

kings %>%
    filter(!is.na(Start_year)) %>%
    mutate(Reign_duration = End_year - Start_year) %>% 
    mutate(Midyear = End_year - (End_year-Start_year) / 2 )
# A tibble: 52 × 13
   Name        House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
   <chr>       <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  <chr>   <chr>  <chr>   <chr>  
 1 Gorm den G… Gorm      936     958     908     958 M      "Jelli… https… "908"   "958"  
 2 Toke_Gorms… Gorm      985     986      NA     986 M      "Jelli… https… ""      ""     
 3 Harald 2.   Gorm     1014    1018     963      NA M      "Jelli… https… "994"   "1018" 
 4 Knud 1. de… Gorm     1018    1035      NA    1035 M      "Jelli… https… "995"   "12/11…
 5 Hardeknud   Gorm     1035    1042     995    1042 M      "Jelli… https… "1018"  "08/06…
 6 Magnus den… Fair…    1042    1047    1018    1047 M      "Jelli… https… "1024"  "25/10…
 7 Svend 2. E… Estr…    1047    1074    1024    1076 M      ""      https… "1019"  "28/04…
 8 Harald 3. … Estr…    1074    1080    1019    1080 M      "Jelli… https… "1041"  "17/04…
 9 Knud 2. de… Estr…    1080    1086    1041    1086 M      "Jelli… https… "1043"  "10/07…
10 Oluf 1. Hu… Estr…    1086    1095    1043    1095 M      "Jelli… https… "1050"  "18/08…
# … with 42 more rows, 2 more variables: Reign_duration <dbl>, Midyear <dbl>, and
#   abbreviated variable names ¹​Start_year, ²​End_year, ³​Birth_year, ⁴​Death_year,
#   ⁵​BirthDMY, ⁶​DeathDMY

The ! symbol negates the result of the is.na() function. Thus, if is.na() returns a value of TRUE (because the Start_year is missing), the ! symbol negates this and says we only want values of FALSE, where Start_year is not missing.

Exercise 2

Load the kings data and create a new dataframe that meets the following criteria: contains the Name and House column and three new columns called Years_lived,Reign_duration, and Midyear containing values that encompass the monarch’s life, duration of reign and the middle year of their reign (end of rule year - duration/2). Only the rows where Reign_duration is greater than 25 should be shown in the final dataframe.

Hint: think about how the commands should be ordered to produce this data frame!

kings <- kings %>%
    mutate(Years_lived = Death_year - Start_year) %>% 
    mutate(Reign_duration = End_year - Start_year) %>% 
    mutate(Midyear = End_year - (End_year-Start_year) / 2 ) %>% 
    select(Name, House, Gender, Years_lived, Reign_duration, Midyear)

Split-apply-combine data analysis and the summarize() function

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The summarize() function

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group. group_by() takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the average ………:

kings %>%
    group_by(House) %>%
    summarize(mean_rule = mean(Reign_duration))
# A tibble: 8 × 2
  House                                    mean_rule
  <chr>                                        <dbl>
1 Bjelbo                                        12  
2 Estridsen                                     14.9
3 Fairhair                                       5  
4 Gorm                                          NA  
5 Oldenburg                                     25.7
6 Palatinate-Neumarkt                            8  
7 Pomerania                                     43  
8 Schleswig-Holstein-Sonderburg-Glücksborg      32  

You may also have noticed that the output from these calls doesn’t run off the screen anymore. It’s one of the advantages of tbl_df over dataframe.

You can also group by multiple columns:

kings %>%
    group_by(House, Gender) %>%
    summarize(mean_rule = mean(Reign_duration))
# A tibble: 10 × 3
# Groups:   House [8]
   House                                    Gender mean_rule
   <chr>                                    <chr>      <dbl>
 1 Bjelbo                                   M           12  
 2 Estridsen                                F            9  
 3 Estridsen                                M           15.2
 4 Fairhair                                 M            5  
 5 Gorm                                     M           NA  
 6 Oldenburg                                M           25.7
 7 Palatinate-Neumarkt                      M            8  
 8 Pomerania                                M           43  
 9 Schleswig-Holstein-Sonderburg-Glücksborg F           51  
10 Schleswig-Holstein-Sonderburg-Glücksborg M           27.2

When grouping both by House and Gender, we see rows in our table for monarchs of different gender.

Exercise 3

  1. Try summarizing the kings by the average number of years they lived, as an exercise.
kings %>%
    group_by(House) %>%
    summarize(mean_years = mean(Years_lived))
# A tibble: 8 × 2
  House                                    mean_years
  <chr>                                         <dbl>
1 Bjelbo                                         12  
2 Estridsen                                      NA  
3 Fairhair                                        5  
4 Gorm                                           NA  
5 Oldenburg                                      28.1
6 Palatinate-Neumarkt                             8  
7 Pomerania                                      63  
8 Schleswig-Holstein-Sonderburg-Glücksborg       NA  

Notice that there are a number of NAs in the result.

  1. Can you exclude the NA data from our table using a filter step?
kings %>%
    filter(!is.na(Years_lived)) %>%
    group_by(House) %>%
    summarize(mean_years = mean(Years_lived))
# A tibble: 8 × 2
  House                                    mean_years
  <chr>                                         <dbl>
1 Bjelbo                                         12  
2 Estridsen                                      17.9
3 Fairhair                                        5  
4 Gorm                                           11.8
5 Oldenburg                                      28.1
6 Palatinate-Neumarkt                             8  
7 Pomerania                                      63  
8 Schleswig-Holstein-Sonderburg-Glücksborg       27.2
  1. Can you round the mean_years so we have only whole numbers?
kings %>%
    filter(!is.na(Years_lived)) %>%
    group_by(House) %>%
    summarize(mean_years = round(mean(Years_lived),0))
# A tibble: 8 × 2
  House                                    mean_years
  <chr>                                         <dbl>
1 Bjelbo                                           12
2 Estridsen                                        18
3 Fairhair                                          5
4 Gorm                                             12
5 Oldenburg                                        28
6 Palatinate-Neumarkt                               8
7 Pomerania                                        63
8 Schleswig-Holstein-Sonderburg-Glücksborg         27

Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum and maximum years ruled for each House for each group (men or women):

kings %>%
    filter(!is.na(Reign_duration)) %>%
    group_by(House, Gender) %>%
    summarize(max_rule = max(Reign_duration),
              min_rule = min(Reign_duration))
# A tibble: 10 × 4
# Groups:   House [8]
   House                                    Gender max_rule min_rule
   <chr>                                    <chr>     <dbl>    <dbl>
 1 Bjelbo                                   M            12       12
 2 Estridsen                                F             9        9
 3 Estridsen                                M            39        2
 4 Fairhair                                 M             5        5
 5 Gorm                                     M            22        1
 6 Oldenburg                                M            60        9
 7 Palatinate-Neumarkt                      M             8        8
 8 Pomerania                                M            43       43
 9 Schleswig-Holstein-Sonderburg-Glücksborg F            51       51
10 Schleswig-Holstein-Sonderburg-Glücksborg M            43        6

It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min_rule to put the group with the shortest reign first:

kings %>%
    filter(!is.na(Reign_duration)) %>%
    group_by(House, Gender) %>%
    summarize(max_rule = max(Reign_duration),
              min_rule = min(Reign_duration)) %>%
    arrange(min_rule)
# A tibble: 10 × 4
# Groups:   House [8]
   House                                    Gender max_rule min_rule
   <chr>                                    <chr>     <dbl>    <dbl>
 1 Gorm                                     M            22        1
 2 Estridsen                                M            39        2
 3 Fairhair                                 M             5        5
 4 Schleswig-Holstein-Sonderburg-Glücksborg M            43        6
 5 Palatinate-Neumarkt                      M             8        8
 6 Estridsen                                F             9        9
 7 Oldenburg                                M            60        9
 8 Bjelbo                                   M            12       12
 9 Pomerania                                M            43       43
10 Schleswig-Holstein-Sonderburg-Glücksborg F            51       51

To sort in descending order, we need to add the desc() function. If we want to sort the results by decreasing order of minimum household size:

kings %>%
    filter(!is.na(Reign_duration)) %>%
    group_by(House, Gender) %>%
    summarize(max_rule = max(Reign_duration),
              min_rule = min(Reign_duration)) %>%
    arrange(desc(min_rule))
# A tibble: 10 × 4
# Groups:   House [8]
   House                                    Gender max_rule min_rule
   <chr>                                    <chr>     <dbl>    <dbl>
 1 Schleswig-Holstein-Sonderburg-Glücksborg F            51       51
 2 Pomerania                                M            43       43
 3 Bjelbo                                   M            12       12
 4 Estridsen                                F             9        9
 5 Oldenburg                                M            60        9
 6 Palatinate-Neumarkt                      M             8        8
 7 Schleswig-Holstein-Sonderburg-Glücksborg M            43        6
 8 Fairhair                                 M             5        5
 9 Estridsen                                M            39        2
10 Gorm                                     M            22        1

Counting

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count(). For example, if we wanted to count the number of rows of data for each House, we would do:

kings %>%
    count(House)
# A tibble: 8 × 2
  House                                        n
  <chr>                                    <int>
1 Bjelbo                                       1
2 Estridsen                                   22
3 Fairhair                                     1
4 Gorm                                         7
5 Oldenburg                                   16
6 Palatinate-Neumarkt                          1
7 Pomerania                                    1
8 Schleswig-Holstein-Sonderburg-Glücksborg     5

For convenience, count() provides the sort argument to get results in decreasing order:

kings %>%
    count(House, sort = TRUE)
# A tibble: 8 × 2
  House                                        n
  <chr>                                    <int>
1 Estridsen                                   22
2 Oldenburg                                   16
3 Gorm                                         7
4 Schleswig-Holstein-Sonderburg-Glücksborg     5
5 Bjelbo                                       1
6 Fairhair                                     1
7 Palatinate-Neumarkt                          1
8 Pomerania                                    1

Exercise 4

  1. How many Danish monarchs enjoyed a longer-than-average reign?

Suggested steps:

  • Select the Names and Reign_duration columns with select()
  • filter() the kings dataset Reign_duration > the-average-length that you calculated before and count the resulting rows with count()
# for the average years of rule
kings %>%
  
  filter(Reign_duration > 29.5) %>% 
  count()
# A tibble: 1 × 1
      n
  <int>
1    14
  • Use mutate() to create a new column AboveAverage and populate it with TRUE/FALSE which will be a response to the question of whether the Reign_duration value is bigger or smaller than average-length
  • Group the unique values in the AboveAverage column with group_by() and then count(). What is your result?
kings %>%
  mutate(AboveAverage = Reign_duration > 29) %>% 
  group_by(AboveAverage) %>% 
  count()
# A tibble: 3 × 2
# Groups:   AboveAverage [3]
  AboveAverage     n
  <lgl>        <int>
1 FALSE           38
2 TRUE            14
3 NA               2
  1. Sort kings by reign duration in ascending and descending order. Which three ruled the longest/shortest?

  2. How many days did the three longest-ruling kings rule? Use mutate() to create a column called Days calculating the total number of days they ruled. Consider transition year in your calculation! (366 days)

kings %>%
  arrange(Reign_duration)
# A tibble: 54 × 6
   Name                House                                Gender Years…¹ Reign…² Midyear
   <chr>               <chr>                                <chr>    <dbl>   <dbl>   <dbl>
 1 Toke_Gormsen        Gorm                                 M            1       1    986.
 2 Abel                Estridsen                            M            2       2   1251 
 3 Erik 2. Emune       Estridsen                            M            3       3   1136.
 4 Valdemar 3.         Estridsen                            M           38       3   1328.
 5 Christoffer 2.      Estridsen                            M            3       3   1330.
 6 Harald 2.           Gorm                                 M           NA       4   1016 
 7 Magnus den Gode     Fairhair                             M            5       5   1044.
 8 Harald 3. Hen       Estridsen                            M            6       6   1077 
 9 Knud 2. den Hellige Estridsen                            M            6       6   1083 
10 Frederik 8.         Schleswig-Holstein-Sonderburg-Glück… M            6       6   1909 
# … with 44 more rows, and abbreviated variable names ¹​Years_lived, ²​Reign_duration
  1. What is the total number of Christians and Frederiks among the Danish kings? Search on patterns among character strings can be done with grepl() or str_detect() functions and a wild card.
kings %>%
  filter(str_detect(Name, "Christian*|Fred*"))
# A tibble: 18 × 6
   Name          House                                    Gender Years_l…¹ Reign…² Midyear
   <chr>         <chr>                                    <chr>      <dbl>   <dbl>   <dbl>
 1 Christian 2.  Oldenburg                                M             46      10   1518 
 2 Frederik 1.   Oldenburg                                M             10      10   1528 
 3 Christian 3.  Oldenburg                                M             23      23   1548.
 4 Frederik 2.   Oldenburg                                M             29      29   1574.
 5 Christian 4.  Oldenburg                                M             60      60   1618 
 6 Frederik 3.   Oldenburg                                M             22      22   1659 
 7 Christian 5.  Oldenburg                                M             29      29   1684.
 8 Frederik 4.   Oldenburg                                M             34      31   1714.
 9 Christian 6.  Oldenburg                                M             16      16   1738 
10 Frederik 5.   Oldenburg                                M             20      20   1756 
11 Christian 7.  Oldenburg                                M             42      42   1787 
12 Frederik 6.   Oldenburg                                M             31      31   1824.
13 Christian 8.  Oldenburg                                M              9       9   1844.
14 Frederik 7.   Oldenburg                                M             15      15   1856.
15 Christian 9.  Schleswig-Holstein-Sonderburg-Glücksborg M             43      43   1884.
16 Frederik 8.   Schleswig-Holstein-Sonderburg-Glücksborg M              6       6   1909 
17 Christian 10. Schleswig-Holstein-Sonderburg-Glücksborg M             35      35   1930.
18 Frederik 9.   Schleswig-Holstein-Sonderburg-Glücksborg M             25      25   1960.
# … with abbreviated variable names ¹​Years_lived, ²​Reign_duration

Saving data

Now that you have nicely enriched the kings, we can save this dataframe to our data_output directory.

write_csv(kings, file = "data_output/kings_plotting.csv")

Keypoints

  • Use the dplyr package to manipulate dataframes.
  • Use select() to choose variables from a dataframe.
  • Use filter() to choose data based on values.
  • Use group_by() and summarize() to work with subsets of data.
  • Use mutate() to create new variables.